SpikingVTG
======== 
This code implements the methodology described in the paper titled: "SpikingVTG: A Spiking Detection Transformer for Video Temporal Grounding".

Video Temporal Grounding (VTG) aims to retrieve precise temporal segments in a video conditioned on natural language queries. Unlike conventional neural frameworks that rely heavily on computationally expensive dense matrix multiplications, Spiking Neural Networks (SNNs)—previously underexplored in this domain—offer a unique opportunity to tackle VTG tasks through bio-plausible spike-based communication and an event-driven accumulation-based computational paradigm. We introduce SpikingVTG, a multi-modal spiking detection transformer, designed to harness the computational simplicity and sparsity of SNNs for VTG tasks. Leveraging the temporal dynamics of SNNs, our model introduces a Saliency Feedback Gating (SFG) mechanism that assigns dynamic saliency scores to video clips and applies multiplicative gating to highlight relevant clips while suppressing less informative ones. SFG enhances performance and reduces computational overhead by minimizing neural activity. We analyze the layer-wise convergence dynamics of SFG-enabled model and apply implicit differentiation at equilibrium to enable efficient, BPTT-free training. To improve generalization and maximize performance, we enable knowledge transfer by optimizing a Cos-L2 representation matching loss that aligns the layer-wise representation and attention maps of a non-spiking teacher with those of our student SpikingVTG. Additionally, we present Normalization-Free (NF)-SpikingVTG, which eliminates non-local operations like softmax and layer normalization, and an extremely quantized 1-bit (NF)-SpikingVTG variant for potential deployment on edge devices. Our models achieve competitive results on QVHighlights, Charades-STA, TACoS, and YouTube Highlights, establishing a strong baseline for multi-modal spiking VTG solutions.

Installation
============
Run command below to install the required packages (**using python3**).
```bash
pip install -r requirements.txt
```

## Overall Repository Structure

```
ide_methods/    
    snn_vtg_modules.py                    Model components for SpikingVTG
    snn_vtg_modules_no_norm.py            Model components for NF-SpikingVTG
    snn_vtg_modules_quantized_no_norm.py  Model components for 1-bit NF-SpikingVTG
    snn_module.py                         Network dynamics operated in this file
    snnide_vtg_multilayer_module.py       Code for training the SNN

main/
    training_spiking_kd.py                Code for doing CLRM
    train_spiking_output.py               Code for finetuning
    
model/
    spikingVTG.py                         Code where model is initialized
    
spiking_student_model/
    config.py                             Configuration of the student
    pytorch_model.bin                     Distilled model 

data/                                     Store datasets in this folder
```
Multi-stage Training Pipeline
====================

**Stage 1:**
In this step we leverage a pre-trained UniVTG model as a "teacher" to do CLRM loss optimization with our SpikingVTG "student". The dataset we use is QVHighlights.

(a) Download the pretrained model UniVTG from UniVTG paper. Preprocess the input as done in the UniVTG paper..
(b) Create the student model configuration is a separate folder. (spiking_student_model/config.py)
(c) Do CLRM as described in the paper


```
python training_spiking_kd.py \
--dset_type mr \
--dset_name qvhighlights \
--clip_length 2 \
--gpu_id 0 \
--device 0 \
--exp_id qvhl \
--model_id univtg_original \
--v_feat_types slowfast_clip \
--t_feat_type clip \
--ctx_mode video_tef \
--train_path data/qvhighlights/metadata/qvhighlights_train.jsonl \
--eval_path data/qvhighlights/metadata/qvhighlights_val.jsonl \
--eval_split_name val \
--v_feat_dirs data/qvhighlights/vid_slowfast data/qvhighlights/vid_clip \
--v_feat_dim 2816 \
--t_feat_dir data/qvhighlights/txt_clip \
--t_feat_dim 512 \
--dim_feedforward 1024 \
--input_dropout 0.0 \
--dropout 0 \
--droppath 0.0 \
--bsz 32 \
--eval_bsz 4 \
--n_epoch 10 \
--num_workers 16 \
--lr 0.0001 \
--lr_drop 80 \
--lr_warmup 10 \
--wd 0.0001 \
--enc_layers 4 \
--hidden_dim 1024 \
--resume saved_non_spiking_models/qvhl_pt/model_best.ckpt           
```


**Stage 2:** In this step we perform fine-tuning to train the student model. The student model after CLRM should be stored in spiking_student_model. The hyper-parameters are given for QVHighlights dataset.


``` 
python train_spiking_output.py \
--dset_type mr \
--dset_name qvhighlights \
--clip_length 2 \
--gpu_id 0 \
--device 0 \
--exp_id qvhl \
--model_id univtg_original \
--v_feat_types slowfast_clip \
--t_feat_type clip \
--ctx_mode video_tef \
--train_path data/qvhighlights/metadata/qvhighlights_train.jsonl \
--eval_path data/qvhighlights/metadata/qvhighlights_val.jsonl \
--eval_split_name val \
--eval_epoch 1 \
--v_feat_dirs data/qvhighlights/vid_slowfast data/qvhighlights/vid_clip \
--v_feat_dim 2816 \
--t_feat_dir data/qvhighlights/txt_clip \
--t_feat_dim 512 \
--dim_feedforward 1024 \
--input_dropout 0.5 \
--dropout 0 \
--droppath 0.1 \
--bsz 32 \
--eval_bsz 8 \
--n_epoch 200 \
--num_workers 16 \
--lr 0.0001 \
--lr_drop 80 \
--lr_warmup 10 \
--wd 0.0001 \
--use_cache 1 \
--enc_layers 4 \
--main_metric MR-full-R1@0.7-key \
--nms_thd 0.7 \
--max_before_nms 1000 \
--easy_negative_only 1 \
--b_loss_coef 10 \
--g_loss_coef 10 \
--eos_coef 0.1 \
--f_loss_coef 10 \
--s_loss_intra_coef 0.1 \
--s_loss_inter_coef 0.1 \
--round_multiple -1 \
--eval_mode add \
--hidden_dim 1024 \
--resume saved_non_spiking_models/qvhl_pt/model_best.ckpt

```

Stage 3 and Stage 4 both involves fine-tuning after architectural changes. The steps are explained below,


**Stage 3:** To train the NF-SpikingVTG module, in SpikingVTG.py file inside model import module from spiking_vtg_modules_no_norm instead of  spiking_vtg_modules.  Following this, the finetuning code can be run.

**Stage 4:** To train the 1-bit NF-SpikingVTG module, in SpikingVTG.py file inside model import module from spiking_vtg_modules_quantized_no_norm instead of  spiking_vtg_modules. Following this, the finetuning code can be run.

